Humans use all of their senses to accomplish different tasks in everyday activities. In contrast, existing work on robotic manipulation mostly relies on one, or occasionally two modalities, such as vision and touch. In this work, we systematically study how visual, auditory, and tactile perception can jointly help robots to solve complex manipulation tasks. We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor, with all three sensory modalities fused with a self-attention model. Results on two challenging tasks, dense packing and pouring, demonstrate the necessity and power of multisensory perception for robotic manipulation: vision displays the global status of the robot but can often suffer from occlusion, audio provides immediate feedback of key moments that are even invisible, and touch offers precise local geometry for decision making. Leveraging all three modalities, our robotic system significantly outperforms prior methods.
translated by 谷歌翻译
为基于几何的点云压缩(G-PCC)标准开发了基于学习的自适应环滤波器,以减少属性压缩工件。提出的方法首先生成多个最可行的样品偏移(MPSO)作为潜在的压缩失真近似值,然后线性权重以减轻伪影。因此,我们将过滤后的重建驱动尽可能靠近未压缩的PCA。为此,我们设计了一个由两个连续的处理阶段组成的压缩工件还原网络(CARNET):MPSOS推导和MPSOS组合。 MPSOS派生使用两个流网络来模拟来自直接空间嵌入和频率依赖性嵌入的局部邻域变化,在该嵌入中,稀疏的卷积被利用可从细微和不规则分布的点中最佳汇总信息。 MPSOS组合由最小平方误量学指导,以进一步捕获输入PCAS的内容动力学,从而得出加权系数。 Carnet作为GPCC的环内过滤工具实现,其中这些线性加权系数被封装在比特斯流中,并以忽略不计的比特率开销。实验结果表明,对最新的GPCC的主观和客观性都显着改善。
translated by 谷歌翻译
“猴子看到猴子做”是一句古老的格言,指的是na \ ive imitation,而没有深刻了解系统的潜在机制。的确,如果示威者可以访问模仿者(猴子)无法获得的信息,例如不同集合的传感器,无论模仿者如何完美地模拟其感知的环境(请参阅),试图重现演示者的行为(DO)都会导致不良的结果。在已经研究了演示者和模仿者之间的不匹配的情况下模仿学习在因果模仿学习的文献中(Zhang等,2020),但现有的解决方案仅限于单阶段的决策。本文研究了在顺序设置中必须使模仿者必须做出的因果模仿学习的问题每个情节的多个决定。我们制定了一个图形标准,这是确定因果模仿的可行性所必需的,以便在模仿者可以垫子的情况下提供条件尽管功能不同,但演示者的表现也很大。最后,我们提供了一种有效的算法来确定仿真性并用模拟证实我们的理论。
translated by 谷歌翻译
儿童学习的常见方式之一是模仿成年人。模仿学习的重点是从专家产生的示威,没有指定的绩效指标和未观察到的奖励信号的示威中进行的学习政策。模仿学习的流行方法首先直接模仿专家的行为政策(行为克隆)或学习优先考虑观察到的专家轨迹(逆强化学习)的奖励功能。但是,这些方法依赖于以下假设:专家用来确定其行为的协变量得到了完全观察。在本文中,当学习者和专家的感觉输入不同时,我们将放松这一假设和学习模仿学习。首先,我们提供了一个非参数,图形标准,该标准是从示范数据的组合和关于基础环境的定性假设组合来确定模仿的可行性的,该标准以因果模型的形式表示。然后,我们表明,当这种标准不满足时,通过利用专家轨迹的定量知识,模仿仍然可以是可行的。最后,我们开发了一个有效的程序,可以从专家的轨迹中学习模仿政策。
translated by 谷歌翻译
从单眼图像中恢复纹理的3D网格是高度挑战的,尤其是对于缺乏3D地面真理的野外物体。在这项工作中,我们提出了网络文化,这是一个新的框架,可通过利用3D GAN预先训练的3D纹理网格合成的3D GAN的生成性先验。重建是通过在3D GAN中搜索最类似于目标网格的潜在空间来实现重建。由于预先训练的GAN以网状几何形状和纹理封装了丰富的3D语义,因此在GAN歧管内进行搜索,因此自然地使重建的真实性和忠诚度正常。重要的是,这种正则化直接应用于3D空间,从而提供了在2D空间中未观察到的网格零件的关键指导。标准基准测试的实验表明,我们的框架获得了忠实的3D重建,并在观察到的部分和未观察到的部分中都具有一致的几何形状和纹理。此外,它可以很好地推广到不太常见的网格中,例如可变形物体的扩展表达。代码在https://github.com/junzhezhang/mesh-inversion上发布
translated by 谷歌翻译
倒角距离(CD)和地球移动器的距离(EMD)是两个广泛采用的度量标准,用于测量两点集之间的相似性。然而,CD通常对不匹配的局部密度不敏感,EMD通常由全球分配主导,而忽略了详细结构的保真度。此外,他们的无限值范围从异常值引起沉重的影响。这些缺陷可防止它们提供一致的评估。为了解决这些问题,我们提出了一个名为密度感知倒角距离(DCD)的新的相似度量。它来自CD的源自来自若干所需性质的效果:1)它可以检测密度分布的差异,因此与CD相比更加强烈的相似性。 2)更严格,具有详细的结构,比EMD明显更加计算; 3)界限值范围促进整个测试集更稳定和合理的评估。我们采用DCD来评估点云完成任务,实验结果表明,DCD关注整体结构和本地几何细节,即使CD和EMD相互矛盾,也能提供更可靠的评估。我们还可以使用DCD作为培训损失,这胜过与所有三个指标上的CD损失培训的相同模型。此外,我们提出了一种新的点鉴别器模块,其估计另一个引导的下采样步骤的优先级,并且它在DCD下实现了明显的改进以及CD和EMD的竞争结果。我们希望我们的工作可以为更全面而实用的点云相似性评估铺平道路。我们的代码将可用:https://github.com/wutong16/dentions_aware_Chamfer_distance。
translated by 谷歌翻译
随着机器学习(ml)的进步及其日益增长的意识,许多拥有数据但不是ML专业知识(数据所有者)的组织希望汇集他们的数据并与那些具有专业知识的人合作,但需要来自不同来源的数据,以便训练真正普遍的资料模型(模型所有者)。在这种协作ML中,数据所有者希望保护其培训数据的隐私,而模型所有者希望模型的机密性和可能包含知识产权的培训方法。但是,现有的私人ML解决方案,如联合学习和分裂学习,不能同时满足数据和模型所有者的隐私要求。本文介绍了城可扩展的协作ML系统,可根据英特尔SGX在不受信任的基础架构中保护两个数据所有者和模型所有者的隐私。 CITADEL在代表数据所有者和代表模型所有者运行的多个训练环路中执行分布式训练。 CITADEL通过零和屏蔽和分层聚合进一步在这些外地之间建立了强大的信息屏障,以防止在协同培训期间防止数据/模型泄漏。与现有的SGX保护培训系统相比,Citadel实现了合作ML的更好的可扩展性和更强大的隐私保障。具有各种ML模型的云部署显示,Citadel缩放到大量的环路,由SGX引起的小于1.73x放缓。
translated by 谷歌翻译
Recognizing useful named entities plays a vital role in medical information processing, which helps drive the development of medical area research. Deep learning methods have achieved good results in medical named entity recognition (NER). However, we find that existing methods face great challenges when dealing with the nested named entities. In this work, we propose a novel method, referred to as ASAC, to solve the dilemma caused by the nested phenomenon, in which the core idea is to model the dependency between different categories of entity recognition. The proposed method contains two key modules: the adaptive shared (AS) part and the attentive conditional random field (ACRF) module. The former part automatically assigns adaptive weights across each task to achieve optimal recognition accuracy in the multi-layer network. The latter module employs the attention operation to model the dependency between different entities. In this way, our model could learn better entity representations by capturing the implicit distinctions and relationships between different categories of entities. Extensive experiments on public datasets verify the effectiveness of our method. Besides, we also perform ablation analyses to deeply understand our methods.
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译